tidyverse I: dplyr;
gapmindertidyverseII: readr, ggplot2;
Public Data, WDI, WIR, etc**tidyverse III: tidyr, etc.; WDI, WIR,
etctidyverse IV; WDI, WIR, etclibrary(tidyverse)
library(gapminder)
library(maps)
library(WDI)
(df <- gapminder)
asean <- c("Brunei", "Cambodia", "Laos", "Myanmar", "Philippines", "Indonesia", "Malaysia", "Singapore")
df %>% filter(country %in% asean) %>%
ggplot(aes(x = year, y = gdpPercap, col = country)) + geom_line()
df %>% filter(country %in% asean) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) + geom_point()
df %>% filter(country %in% asean) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) +
geom_point() + coord_trans(x = "log10", y = "identity")
\(\log_{10}{100}\) = 2, \(\log_{10}{1000}\) = 3, \(\log_{10}{10000}\) = 4
df_wdi <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD")
)
df_wdi
df_wdi_extra <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD"),
extra = TRUE
)
df_wdi_extra
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
The term Open Data has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.
WDI(country = "all",
indicator = "NY.GDP.PCAP.KD",
start = 1960,
end = 2020,
extra = FALSE,
cache = NULL)
c('women_private_sector' = 'BI.PWK.PRVS.FE.ZS')library(WDI)
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", cache = NULL)
WDIsearch(string = "population",
field = "name", short=FALSE, cache = NULL)
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", short = FALSE, cache = NULL)
WDIsearch(string = "gdp",
field = "name", short = TRUE, cache = NULL)
WDIbulk downloads the zip file of Bulk Downloads in WDI site , it is a list containing 6 data frames: Data, Country, Series, Country-Series, Series-Time, FootNote.
timeout: integer maximum number of seconds to wait for
download
wdi <- WDIbulk(timeout = 600)
wdi$Data
wdi$Country
wdi$Series
wdi$`Country-Series`
wdi$`Series-Time`
wdi$FootNote
Download an updated list of available WDI indicators from the World Bank website. Returns a list for use in the WDIsearch function.
wdi_cache <- WDIcache()
Downloading all series information from the World Bank website can
take time. The WDI package ships with a local data object with
information on all the series available on 2012-06-18. You can update
this database by retrieving a new list using WDIcache, and
then feeding the resulting object to WDIsearch via the
cache argument.
wdi_cache
$series
$country
NA
List of 2 data frames
The first character matrix includes a full list of WDI series. This list is updated semi-regularly. Users can refresh the list manually using the ‘WDIcache()’ function and search in the updated list using the ‘cache’ argument.
glimpse(WDI_data)
List of 2
$ series :'data.frame': 20238 obs. of 5 variables:
..$ indicator : chr [1:20238] "1.0.HCount.1.90usd" "1.0.HCount.2.5usd" "1.0.HCount.Mid10to50" "1.0.HCount.Ofcl" ...
..$ name : chr [1:20238] "Poverty Headcount ($1.90 a day)" "Poverty Headcount ($2.50 a day)" "Middle Class ($10-50 a day) Headcount" "Official Moderate Poverty Rate-National" ...
..$ description : chr [1:20238] "The poverty headcount index measures the proportion of the population with daily per capita income (in 2011 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income below the of"| __truncated__ ...
..$ sourceDatabase : chr [1:20238] "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" ...
..$ sourceOrganization: chr [1:20238] "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of data from National Statistical Offices." ...
$ country:'data.frame': 299 obs. of 9 variables:
..$ iso3c : chr [1:299] "ABW" "AFE" "AFG" "AFR" ...
..$ iso2c : chr [1:299] "AW" "ZH" "AF" "A9" ...
..$ country : chr [1:299] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
..$ region : chr [1:299] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
..$ capital : chr [1:299] "Oranjestad" "" "Kabul" "" ...
..$ longitude: chr [1:299] "-70.0167" "" "69.1761" "" ...
..$ latitude : chr [1:299] "12.5167" "" "34.5228" "" ...
..$ income : chr [1:299] "High income" "Aggregates" "Low income" "Aggregates" ...
..$ lending : chr [1:299] "Not classified" "Aggregates" "IDA" "Aggregates" ...
WDI_data$series
WDI_data$country
WDI_data$country %>% filter(country == "Japan")
WDIsearch(string = "gdp",
field = "name", short = FALSE, cache = wdi_cache)
Find indicators:
WDIsearch(string = "gdp", field = "name", short = FALSE, cache = NULL)WDIsearch(string = "gdp", field = "name", short = FALSE, cache = wdi_cache)WDIsearch(string = "NY.GDP.PCAP.KD", field = "indicator", short = FALSE, cache = NULL)WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator",
short = FALSE, cache = wdi_cache)
WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator",
short = FALSE, cache = wdi_cache) %>% pull(description)
co2pcap <- WDI(country = "all", indicator = "EN.ATM.CO2E.PC", start = 1960, end = NULL, extra = TRUE, cache = wdi_cache)
co2pcap
co2pcap %>% filter(country %in% c("World", "Japan", "United States", "China")) %>%
ggplot(aes(x = year, y = EN.ATM.CO2E.PC, color = country)) +
geom_point() + geom_line()
co2pcap %>% filter(!is.na(EN.ATM.CO2E.PC)) %>% pull(year) %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
1990 1997 2005 2005 2012 2019
co2pcap %>% filter(country %in% c("World", "Japan", "United States", "China"), year %in% 1990:2019) %>%
ggplot(aes(x = year, y = EN.ATM.CO2E.PC, color = country)) +
geom_point() + geom_line()
co2pcap %>%
filter(income != "Aggregates", year == 2019) %>%
ggplot(aes(x = income, y = EN.ATM.CO2E.PC, fill = income)) +
geom_boxplot()
co2pcap %>%
filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
ggplot(aes(x = income, y = EN.ATM.CO2E.PC, fill = income)) +
geom_boxplot()
boxplot: https://vimeo.com/222358034co2pcap %>%
filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
group_by(income) %>%
summarize(min = min(EN.ATM.CO2E.PC), med = median(EN.ATM.CO2E.PC), max = max(EN.ATM.CO2E.PC), IQR = IQR(EN.ATM.CO2E.PC), n = n())
co2pcap %>%
filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
filter(!income %in% c("High income", "Low income", "Lower middle income", "Upper middle income"))
co2pcap %>%
filter(income != "Aggregates", year == 2019) %>%
filter(income == "Not classified")
co2pcap %>% distinct(country)
world_map %>% distinct(region)
world_map0 <- world_map %>%
mutate(region = case_when(region == "Macedonia" ~ "North Macedonia",
region == "Ivory Coast" ~ "Cote d'Ivoire",
region == "Democratic Republic of the Congo" ~ "Congo, Dem. Rep.",
region == "Republic of Congo" ~ "Congo, Rep.",
region == "UK" ~ "United Kingdom",
region == "USA" ~ "United States",
region == "Laos" ~ "Lao PDR",
region == "Slovakia" ~ "Slovak Republic",
region == "Saint Lucia" ~ "St. Lucia",
region == "Kyrgyzstan" ~ "Kyrgyz Republic",
region == "Micronesia" ~ "Micronesia, Fed. Sts.",
region == "Swaziland" ~ "Eswatini",
region == "Virgin Islands" ~ "Virgin Islands (U.S.)",
region == "Russia" ~ "Russian Federation",
region == "Egypt" ~ "Egypt, Arab Rep.",
region == "South Korea" ~ "Korea, Rep.",
region == "North Korea" ~ "Korea, Dem. People's Rep.",
region == "Iran" ~ "Iran, Islamic Rep.",
region == "Brunei" ~ "Brunei Darussalam",
region == "Venezuela" ~ "Venezuela, RB",
region == "Yemen" ~ "Yemen, Rep.",
region == "Bahamas" ~ "Bahamas, The",
region == "Syria" ~ "Syrian Arab Republic",
region == "Turkey" ~ "Turkiye",
region == "Cape Verde" ~ "Cabo Verde",
region == "Gambia" ~ "Gambia, The",
region == "Czech Republic" ~ "Czechia",
TRUE ~ region))
co2pcap %>% filter(income != "Aggregates", year == 2019) %>%
anti_join(world_map0, by = c("country"="region"))
world_map0 %>% anti_join(co2pcap, by = c("region"="country")) %>% distinct(region) %>% arrange(region)
world_map0 %>% left_join(iso3166, by = c("region" = "ISOname")) %>%
filter(is.na(a2)) %>% distinct(region)
world_map <- map_data("world")
co2pcap %>% filter(income != "Aggregates", year == 2019) %>%
anti_join(world_map, by = c("country"="region"))
world_map %>% distinct(region)
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
The rest of this tutorial will look at these two questions. To make the discussion easier, let’s define some terms…
ggplot2 Basics
visualization
library(readxl)
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary, destfile = "data/WIR2022s.xlsx")
excel_sheets("data/WIR2022s.xlsx")
Note that the sheet name of F14 has period at the end.
df_f14 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F14.")
df_f14
\n for line break in the title.df_f14 %>%
ggplot(aes(x = Group, y = Share)) +
geom_col()
df_f14 %>%
ggplot(aes(x = Group, y = Share)) +
geom_col(width = 0.5, fill = scales::hue_pal()(1)[1]) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 14. Global carbon inequality, \n2019 Group contribution to world emissions (%)",
x = "", y = "Share of world emissions (%)")
width = 0.5: width of barsfill = scales::hue_pal()(1)[1]): hue scale
scale_y_continuous(labels = scales::percent_format(accuracy = 1)):
percent format
labs(title = "Figure 14. Global carbon inequality, \n2019 Group contribution to world emissions (%)", x = "", y = "Share of world emissions (%)")
\n is for line feeddf_f1 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F1")
New names:
df_f1
df_f1_rev %>%
ggplot(aes(x = cat, y = value, fill = group)) +
geom_col(position = "dodge")
ggplot2WDI and ggplot2
a3_123456.nb.html)
a3_123456.Rmd,a3_123456.nb.html,a3_123456.nb.html to Moodle.Choose at least one indicator of WDI
WDIExplore the data using visualization
Observations and difficulties encountered.
Due: 2023-01-16 23:59:00. Submit your R Notebook file in Moodle (The Third Assignment). Due on Monday!